In this mini-study, we look into the impact of Elon Musk's tweets on BTC since he is one of the biggest Crypto-influencer on Twitter.
# Run the pip install command below if you don't already have the library
# for scraping tweets
# !pip install git+https://github.com/JustAnotherArchivist/snscrape.git
# for NLP
# pip install google-cloud-language
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import re
import os
import time
from datetime import datetime, timedelta
from textblob import TextBlob
from tqdm import tqdm
from google.cloud import language_v1
import matplotlib.pyplot as plt
import plotly
import plotly.offline as pyo
import plotly.express as px
from plotly.offline import plot
import plotly.graph_objects as go
from plotly.validators.scatter.marker import SymbolValidator
import bearishelon.config as cf
pyo.init_notebook_mode()
We used the snscrape library instead of Tweepy because of the recent update of twitter Developer API terms that limits the number of tweets to be scraped. More details on how to use snscrape are included in the reference link below.
tweet_count = 10000
username = "elonmusk"
startdt = '2019-04-01'
enddt = '2021-05-22'
# Using OS library to call CLI commands in Python
os.system("snscrape --jsonl --max-results {} --since {} twitter-search 'from:{} until:{}' > user-tweets.json".format(tweet_count,startdt, username,enddt))
6696
tweets = pd.read_json('user-tweets.json', lines=True)
tweets = tweets[tweets.lang!='ru']
df = tweets[['date','renderedContent','retweetCount']]
df = df.rename(columns={'date': 'timestamp', 'renderedContent': 'tweet','retweetCount':'retweet_count'})
# import tweepy
# api_key = cf.TW_API_KEY
# api_secret = cf.TW_API_SECRET_KEY
# access_token = cf.TW_ACCESS_TOKEN
# access_token_secret = cf.TW_ACCESS_KEY
# auth = tweepy.OAuthHandler(api_key, api_secret)
# auth.set_access_token(access_token, access_token_secret)
# api = tweepy.API(auth,wait_on_rate_limit=True)
# username = 'elonmusk'
# tweets = tweepy.Cursor(api.user_timeline, id = username, tweet_mode='extended').items(200)
# tweet_dict = []
# for t in tqdm(tweets):
# dic = {}
# dic['created_at'] = t.created_at
# dic['retweet_count'] = t.retweet_count
# dic['tweet'] = t.full_text
# tweet_dict.append(dic)
#pd.set_option("display.max_colwidth", -1)
#df = pd.DataFrame.from_dict(tweet_dict)
After collecting the tweets, we keep the only necessary data such as timestamp, full text of the tweet and retweet count as a virality indicator.
Before we process the tweets, we should also remove any mentions (handles), and hyperlinks that could confuse our NLP model. Keywords are kept as they may contain useful influencing keywords.
We also filter the actual content with crypto-related keywords. Unfortunately, keyword like 'moon' was left out because it often picked up news related to SpaceX instead of terms similar to 'to the moon' in the crypto-space.
def clean_tweets(text):
"""Clean data - remove mentions and links, keeping hastags and retweets (sometimes maybe useful)"""
text = re.sub(r"@[A-Za-z0-9]+", "", text) # Remove Mentions - include any - or underscore in maes
#text = re.sub(r"#", "", text) # Remove Hashtags Symbol
#text = re.sub(r"RT[\s]+", "", text) # Remove Retweets
text = re.sub(r"https?:\/\/\S+", "", text) # Remove The Hyper Link
return text
def sentiment_analysis(df):
sentiment = TextBlob(df["tweet"]).sentiment
return pd.Series([sentiment.polarity,sentiment.subjectivity])
df["tweet"] = df["tweet"].apply(clean_tweets)
keywords = ['doge','crypto','centralize','bitcoin','ethereum','💎','diamond','paper']
df['filter'] = df['tweet'].str.contains('|'.join([f'(?i){keyword}' for keyword in keywords]))
df = df[df['filter'] == True]
df.tail()
| timestamp | tweet | retweet_count | filter | |
|---|---|---|---|---|
| 6332 | 2019-04-30 01:15:53+00:00 | Ethereum | 9338 | True |
| 6557 | 2019-04-13 16:22:16+00:00 | Cryptocurrency is my safe word | 1741 | True |
| 6672 | 2019-04-02 20:38:38+00:00 | Dogecoin value may vary theonion.com/bitcoin-plunge… | 1943 | True |
| 6673 | 2019-04-02 20:16:58+00:00 | Dogecoin rulz | 16425 | True |
| 6675 | 2019-04-02 09:24:39+00:00 | _Heats Dogecoin might be my fav cryptocurrency. It’s pretty cool. | 3402 | True |
Using TextBlob, sentiment score (TB_sscore) and subjectivity are generated.
TextBlob scoring
# Adding Subjectivity & Polarity
df[["TB_sscore", "TB_subjectivity"]] = df.apply(sentiment_analysis, axis=1)
To use Google cloud natural language API, you will need to create a Google Cloud account, create a new project and copy the json path containing the keys. Next, enable the Cloud Natural Language API for the specific project (requires billing account, but comes with $300 for free in 90 days in the trial account).
Scoring
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = cf.GCL_path
client = language_v1.LanguageServiceClient()
def googleNLPscore(x):
try:
document = language_v1.Document(content=x, type_=language_v1.Document.Type.PLAIN_TEXT)
sentiment = client.analyze_sentiment(document=document).document_sentiment
sscore = round(sentiment.score,4)
smag = round(sentiment.magnitude,4)
except Exception as e:
print(e)
sscore = 0
smag = 0
return pd.Series([sscore, smag])
df[["GCL_sscore", "GCL_smag"]] = df['tweet'].apply(googleNLPscore)
def scoreinplainEnglish(x):
if x>0:
sentiment = "positive"
elif x== 0:
sentiment = "neutral"
else:
sentiment = "negative"
return pd.Series(sentiment)
df['TB_sentiment'] = df['TB_sscore'].apply(scoreinplainEnglish)
df['GCL_sentiment'] = df['GCL_sscore'].apply(scoreinplainEnglish)
pd.set_option("display.max_colwidth", -1)
df.head()
| timestamp | tweet | retweet_count | filter | TB_sscore | TB_subjectivity | GCL_sscore | GCL_smag | TB_sentiment | GCL_sentiment | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 2021-05-20 20:13:36+00:00 | Bitcoin hashing (aka mining) energy usage is starting to exceed that of medium-sized countries. Almost impossible for small hashers to succeed without those massive economies of scale. | 760 | True | -0.229167 | 0.625000 | 0.1 | 0.5 | negative | positive |
| 5 | 2021-05-20 17:56:35+00:00 | Yeah, I haven’t & won’t sell any Doge | 10320 | True | 0.000000 | 0.000000 | -0.5 | 0.5 | neutral | negative |
| 6 | 2021-05-20 17:53:25+00:00 | A longtime Tesla supporter gave me the Doge dollar sticker at Giga Berlin | 2477 | True | 0.000000 | 0.000000 | 0.1 | 0.1 | neutral | positive |
| 12 | 2021-05-20 13:31:24+00:00 | Currency is already digital! Decentralized crypto is an attempt to wrest power of currency dilution (pernicious form of taxation) & capital controls from governments. That said, I sure hope the cure is better than the disease!\n\nMars/AI are essential to passing the great filter/s. | 1197 | True | 0.385000 | 0.487778 | 0.1 | 1.3 | positive | positive |
| 18 | 2021-05-20 10:41:00+00:00 | How much is that Doge in the window? | 49661 | True | 0.200000 | 0.200000 | 0.0 | 0.0 | positive | neutral |
We import free XBTUSD minute data downloaded from the link http://api.bitcoincharts.com/v1/csv/ and compare against the sentiments. Here we have also included an additional column of manual assessment of the tweets. We simply label each tweet +1 if it is deemed to have positive impact on Bitcoin or -1 if it has negative impact on Bitcoin.
btc = pd.read_csv('data/krakenUSD.csv', header = None, names = ['ts','spot','test']) #kraken data
btc.ts = pd.to_datetime(btc.ts, unit ='s')
btc = btc[btc.ts.dt.year >= 2019]
btc = btc.set_index('ts')
btc = btc.resample('1h').mean()
btc= btc.reset_index()
df = pd.read_csv('filtered_tweets.csv',parse_dates=['timestamp'])
df['timestamp'] = df['timestamp'].dt.tz_localize(None)
pd.options.plotting.backend = "plotly"
plot_from = '2019-01-01 00:00:00'
plot_to = '2021-05-21 00:00:00'
pos_tweets = df[df['manual'] >0 & (df['timestamp'] >= plot_from) & (df['timestamp'] <= plot_to)]
neg_tweets = df[df['manual'] <0 & (df['timestamp'] >= plot_from) & (df['timestamp'] <= plot_to)]
tmp = btc[(btc['ts'] >= plot_from) & (btc['ts'] <= plot_to)].reset_index(drop = True)
layout = go.Layout(title = 'BTCUSD')
fig = go.Figure(data = [go.Scatter(x=tmp["ts"], y=tmp["spot"], mode='lines')], layout= layout)
#positive tweets
x1 = pos_tweets['timestamp'].reset_index()
idx1=[]
for i in range(len(x1)):
idx1.append(tmp['ts'].sub(x1['timestamp'][i]).abs().idxmin())
test1 = tmp['spot'].iloc[idx1].tolist()
y1 = pd.DataFrame(test1,columns=['spot'])
fig.add_trace(go.Scatter(
x= x1['timestamp'],
y= y1['spot'],
mode = "markers",
text = "positive",
hoverinfo='text',
textposition='top center',
showlegend = False,
marker = dict(size = 10, color = 'green', symbol = 'triangle-up'),
textfont=dict(family='sans serif',
size=20,
color='#000000'
))
)
# negative tweets
x2 = neg_tweets['timestamp'].reset_index()
# find nearest spot based on purchase time
idx2=[]
for i in range(len(x2)):
idx2.append(tmp['ts'].sub(x2['timestamp'][i]).abs().idxmin())
test2 = tmp['spot'].iloc[idx2].tolist()
y2 = pd.DataFrame(test2,columns=['spot'])
fig.add_trace(go.Scatter(
x= x2['timestamp'],
y= y2['spot'],
mode = "markers",
text = "negative",
hoverinfo='text',
textposition='top center',
showlegend = False,
marker = dict(size = 10, color = 'red', symbol = 'triangle-down'),
textfont=dict(family='sans serif',
size=20,
color='#000000'
))
)
The following returns for 3 trading duration (30-min, 1 and 2-hr) are also computed. Transaction costs are not factored in.
# gains/losses by 30 mins, 1 hour and 2 hours.
def computePnL(x):
# compute 30 1 and 2 hr
# get spot price of of start
spot_start = btc['spot'].iloc[btc['ts'].sub(x).abs().idxmin()]
spot_30 = btc['spot'].iloc[btc['ts'].sub(x+timedelta(minutes =30)).abs().idxmin()]
spot_60 = btc['spot'].iloc[btc['ts'].sub(x+timedelta(hours=1)).abs().idxmin()]
spot_120 = btc['spot'].iloc[btc['ts'].sub(x+timedelta(hours =2)).abs().idxmin()]
return_30 = spot_30/spot_start-1
return_60 = spot_60/spot_start-1
return_120 = spot_120/spot_start-1
# get spot price of end
return pd.Series([return_30, return_60, return_120])
df[['30mins_return','1hr_return','2hr_return']] = df['timestamp'].apply(computePnL)
df.tail()
| Unnamed: 0 | timestamp | tweet | retweet_count | filter | TB_sscore | TB_subjectivity | GCL_sscore | GCL_smag | TB_sentiment | GCL_sentiment | manual | 30mins_return | 1hr_return | 2hr_return | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 82 | 6332 | 2019-04-30 01:15:53 | Ethereum | 9338 | True | 0.0 | 0.000 | 0.3 | 0.3 | neutral | positive | 1.0 | 0.000319 | 0.000319 | -0.001120 |
| 83 | 6557 | 2019-04-13 16:22:16 | Cryptocurrency is my safe word | 1741 | True | 0.5 | 0.500 | 0.4 | 0.4 | positive | positive | NaN | 0.001730 | 0.001730 | 0.002333 |
| 84 | 6672 | 2019-04-02 20:38:38 | Dogecoin value may vary theonion.com/bitcoin-plunge… | 1943 | True | 0.0 | 0.000 | -0.3 | 0.3 | neutral | negative | -1.0 | 0.000000 | 0.000394 | 0.017245 |
| 85 | 6673 | 2019-04-02 20:16:58 | Dogecoin rulz | 16425 | True | 0.0 | 0.000 | 0.3 | 0.3 | neutral | positive | 1.0 | 0.003989 | 0.003989 | 0.004384 |
| 86 | 6675 | 2019-04-02 09:24:39 | _Heats Dogecoin might be my fav cryptocurrency. It’s pretty cool. | 3402 | True | 0.3 | 0.825 | 0.9 | 1.8 | positive | positive | 1.0 | 0.001526 | 0.001526 | 0.003516 |
# check for correct or incorrect prediction ()
# any values below 0 means return sign and prediction sign are different
df['30mins_dir'] = (df['30mins_return']*df['manual'] )
df['1hr_dir'] = (df['1hr_return']*df['manual'] )
df['2hr_dir'] = (df['2hr_return']*df['manual'] )
fig, (ax1,ax2,ax3,ax4) = plt.subplots(4,1,figsize =(15,15))
ax1.scatter(df['timestamp'],df['30mins_dir'])
ax1.axhline(y=0.0, color='r', linestyle='--')
ax2.scatter(df['timestamp'],df['1hr_dir'])
ax2.axhline(y=0.0, color='r', linestyle='--')
ax3.scatter(df['timestamp'],df['2hr_dir'])
ax3.axhline(y=0.0, color='r', linestyle='--')
ax1.set_ylabel('Prediction direction (opposite dir. if < 0)')
ax4.scatter(df['timestamp'],df['retweet_count'],color='red')
ax4.set_ylabel('Retweet counts')
Text(0, 0.5, 'Retweet counts')
It is noticed that prior to 2021, Elon Musk tweeted way less about cryptocurrency in comparison to 2021. Before 2021, we see that there are more data points below 0, meaning the price went in the opposite direction of the sentiment that the tweet implies and that actually skewed the overall win-rate. Based on tweet data from 2021 onwards (where Elon musk tend to be more vocal and active in tweeting about cryptos), we would in fact have a >60% chance of winning within 1-2 hours based on his tweets.
# filter 2021 and later
df = df[df.timestamp.dt.year >=2021]
test1 = df['30mins_dir'] >0
test2 = df['1hr_dir'] >0
test3 = df['2hr_dir'] >0
test2.value_counts()/len(test2)
True 0.661017 False 0.338983 Name: 1hr_dir, dtype: float64
Although the downside of Cloud Natural Language API is that it is a paid service, it does outperform textblob. It was able to read emoji well but there were still many tweets (including sarcasm) that were incorrectly interpreted. This could also be because the context of the replies were not put in place.
We also manually computed the 30-min, 1-hr and 2 hr PnL of BTC immediately after the tweet broke out, and there was no correlation between the GCL sentiment score/ manual assessment of the sentiment and the actual PnL. To use this as an entry/exit signal, we would need further improvement to the prediction model in reading tweets and also the way we interpret the whole chain of conversation. However, because they don't happen very often, we could in fact do this discretionarily.
Looking at his retweet counts, it appears that tweets that he made in 2021 are more viral and could potentially 'influence' the market more. We also filtered 2021 data alone, and found higher win-rates (more than 60%) if we were to trade based on his tweets with duration between 30mins to 2 hours.